Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix guided decoding with tokenizer mode mistral #11046

Merged

Conversation

wallashss
Copy link
Contributor

@wallashss wallashss commented Dec 10, 2024

This PR address the issues on #11045.

Changelog:

  • Fix mistral tokenizer when decode is overwritten by outlines processors, the adapter change this method and when the tokenizer calls this function it gets a wrong response.
  • Added support for mistral tokenizer with XGrammar
  • Added tests to cover mistral tokenizer mode
  • Minor refactor on the xgrammar decoding
  • Fix tests on test_guided_processors and added then to CI which were not executing to validate changes

FIX #11045 (link existing issues this PR will resolve)

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Dec 10, 2024
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and trying to clean up along the way. My main concern is with using the tokenizer's vocab size - I think this may not be padded to the model's final logit size. Also the backend_str/vocab_type logic is a little confusing now.

Comment on lines 171 to 189
vocab_size=model_config.hf_config.vocab_size,
vocab_size=tokenizer.vocab_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarnphm there is a reason why we needed to reference the model's vocab size and not the tokenizers, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that was a shot, I knew that someone could clarify this. I have already have a conflict on this part #11043 recently merged so I will revert this change. But would be nice to have documented the difference of these vocab sizes. @aarnphm do you have any reference for that?

Copy link
Contributor Author

@wallashss wallashss Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for check it out @mgoin! I really appreciate how fast you answered to this.

Also the backend_str/vocab_type logic is a little confusing now.

Yeah, but the problem of this part is it was not correct. I reported a traceback on #11045.

from_huggingface(): incompatible function arguments. The following argument types are supported: (arg0: list[str], arg1: str, arg2: Optional[int], arg3: Optional[list[int]]) -> xgrammar.xgrammar_bindings.TokenizerInfo

First, backend_str is typed with str and was assigned with a Enum value (VocabType). And this function from_huggingface does not receive vocabType, and that's why it crashes using Mistral Tokenizer and possibly other types of tokenizers as well. So, for these cases we have call the contructor of TokenizerInfo and pass a vocabType (and no backend_str), which may be RAW for tokenizer like tikToken and BYTE_FALLBACK for mistral. I think maybe I can make it more explicit, first time I wrote a solution for that I added a bool field on TokenizerData like is_from_huggingface, but I decided to make it more concise and infer that based on the value of these fields (which must be mutually exclusive). Of course I can add more comments to clarify and add asserts to prevent wrong behavior.

What do you think?

Copy link
Contributor

@aarnphm aarnphm Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reference tokenizer vocab size because of additional padding tokens.

this is the thread https://vllm-dev.slack.com/archives/C07QQ8DAXMK/p1732673561777159

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks! I reverted the code.

Copy link

mergify bot commented Dec 10, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wallashss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 10, 2024
@mergify mergify bot removed the needs-rebase label Dec 10, 2024
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Copy link
Contributor

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few finals requests from my end, otherwise LGTM

@@ -115,25 +116,9 @@ async def get_guided_decoding_logits_processor(
def get_local_guided_decoding_logits_processor(
guided_params: GuidedDecodingParams, tokenizer: PreTrainedTokenizer,
model_config: ModelConfig) -> LogitsProcessor | None:
guided_params = maybe_backend_fallback(guided_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert this change?

This is being used for offline use case, with LLM, where as get_guided_decoding_logit_processor is being used for online usecase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed what I did and checked that it was not so good based on the difference of implementation of the methods get_local_outlines_guided_decoding_logits_processor and get_outlines_guided_decoding_logits_processor. But I tried something a little bit difference to not revert everything, just to avoid code duplication. See if you agree, if not I won't insist I can revert with no problem. Also I updated the tests to check the offline and online version to pass all over these code paths, considering the offline path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is even more confusing now that there are three functions. I would prefer a revert as it seems you have no other changes to this file? We can consider refactor in another PR

Signed-off-by: Wallas Santos <[email protected]>
Copy link

mergify bot commented Dec 12, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wallashss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 12, 2024
@mergify mergify bot removed the needs-rebase label Dec 12, 2024
Comment on lines +46 to +55
# These fields are mutually exclusive: `backend_str` is used to create a
# TokenizeInfo with `TokenizerInfo.from_huggingface` while `vocab_type` is
# used within the constructor of TokenizeInfo
backend_str: str | None = None
vocab_type: xgr.VocabType | None = None

def __post_init__(self):
# Check for mutual exclusive
assert not (self.backend_str and self.vocab_type), \
"backend_str and vocab_type are mutual exclusive"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it possible to just either use backend_str or vocab_type here?

the cartesian product is probably harder to document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it possible to just either use backend_str or vocab_type here?

The way get_tokenizer_data was implemented, I don't see another way. The method from_hugging_face from xgrammar have several paths, and depending the type of the tokenizer it needs backend_str or vocab_type.

the cartesian product is probably harder to document.

I'm sorry, what you mean by cartesian product? Can you elaborate more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two mutual exclusive options, which will have different combinations across different multiples.

Considering how much models vLLM is currently supported, in this case, we will have to draw the table of what combinations work/doesn't work, hence the cartesian product of combinations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it. Do you think documentation is required for this PR? My primary goal was to address the issue with guided decoding when using tokenizer_mode mistral, as mentioned in the related PR. Without this fix, vLLM crashes for xgrammar due to an unhandled code path.

In my opinion, it might be beneficial to apply these fixes and then assess whether extensive documentation or code checks are required to help users understand unsupported features. What do you think?

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice I think this LGTM once you remove the xgrammar import in the test and revert vllm/model_executor/guided_decoding/__init__.py

from vllm.model_executor.guided_decoding.outlines_logits_processors import (
JSONLogitsProcessor, RegexLogitsProcessor)
from vllm.model_executor.guided_decoding.xgrammar_decoding import TokenizerData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing from xgrammar_decoding.py will also import+require xgrammar, in addition to your import xgrammar as xgr line below. I would recommend making your test_pickle_xgrammar_tokenizer_data function skip if xgrammar isn't able to be imported, and simply including these two imports within that function.

In fact it may be best to add a dedicated xgrammar testing file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change but you need to move from vllm.model_executor.guided_decoding.xgrammar_decoding import TokenizerData to the try-except as well

@@ -115,25 +116,9 @@ async def get_guided_decoding_logits_processor(
def get_local_guided_decoding_logits_processor(
guided_params: GuidedDecodingParams, tokenizer: PreTrainedTokenizer,
model_config: ModelConfig) -> LogitsProcessor | None:
guided_params = maybe_backend_fallback(guided_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is even more confusing now that there are three functions. I would prefer a revert as it seems you have no other changes to this file? We can consider refactor in another PR

Signed-off-by: Wallas Santos <[email protected]>
@wallashss
Copy link
Contributor Author

Thanks for the review @mgoin @aarnphm.

I think this is even more confusing now that there are three functions. I would prefer a revert as it seems you have no other changes to this file? We can consider refactor in another PR

NP, It made more sense to me when I tried to remove the model_config.

I would recommend making your test_pickle_xgrammar_tokenizer_data function skip if xgrammar isn't able to be imported, and simply including these two imports within that function.

Added the conditional skip

In fact it may be best to add a dedicated xgrammar testing file

Agree, I added a TODO, I guess it make sense to do that in another PR.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice LGTM!

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 17, 2024

It appears that AMD CI isn't using the correct version of outlines, or one of its dependencies.

@wallashss
Copy link
Contributor Author

wallashss commented Dec 17, 2024

Hey @DarkLight1337 tks for checking this.

I did a quick search and I think there is a conflict with this lark lib. I could reproduce this import error in my environment if I do a pip install lark-parser (found similar issues by google it). I am not sure what is going on on this AMD image, and unfortunately I cannot pull this image to investigate.

@wallashss
Copy link
Contributor Author

wallashss commented Dec 17, 2024

I checked the image of the CI for AMD and found this:

lark==0.12.0

In my environment I have

lark==1.2.2

Not sure the best way to fix besides set it in requirements... Any suggestion?

@DarkLight1337
Copy link
Member

I guess you just have to update the requirements file.

Signed-off-by: Wallas Santos <[email protected]>
Copy link

mergify bot commented Dec 17, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wallashss.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 17, 2024
@mergify mergify bot removed the needs-rebase label Dec 17, 2024
@wallashss
Copy link
Contributor Author

@DarkLight1337 updated it. Let's see what CI says now.

@simon-mo simon-mo merged commit 8b79f9e into vllm-project:main Dec 18, 2024
72 of 75 checks passed
SageMoore pushed a commit to neuralmagic/vllm that referenced this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Guided decoding crashing when tokenizer_mode is set to mistral
5 participants